Prosper Loan Data Exploration by Yeqing Zhang

## [1] 113937     81
##                    ListingKey     ListingNumber    
##  17A93590655669644DB4C06:     6   Min.   :      4  
##  349D3587495831350F0F648:     4   1st Qu.: 400919  
##  47C1359638497431975670B:     4   Median : 600554  
##  8474358854651984137201C:     4   Mean   : 627886  
##  DE8535960513435199406CE:     4   3rd Qu.: 892634  
##  04C13599434217079754AEE:     3   Max.   :1255725  
##  (Other)                :113912                    
##                     ListingCreationDate  CreditGrade         Term      
##  2013-10-02 17:20:16.550000000:     6          :84984   Min.   :12.00  
##  2013-08-28 20:31:41.107000000:     4   C      : 5649   1st Qu.:36.00  
##  2013-09-08 09:27:44.853000000:     4   D      : 5153   Median :36.00  
##  2013-12-06 05:43:13.830000000:     4   B      : 4389   Mean   :40.83  
##  2013-12-06 11:44:58.283000000:     4   AA     : 3509   3rd Qu.:36.00  
##  2013-08-21 07:25:22.360000000:     3   HR     : 3508   Max.   :60.00  
##  (Other)                      :113912   (Other): 6745                  
##                  LoanStatus                  ClosedDate   
##  Current              :56576                      :58848  
##  Completed            :38074   2014-03-04 00:00:00:  105  
##  Chargedoff           :11992   2014-02-19 00:00:00:  100  
##  Defaulted            : 5018   2014-02-11 00:00:00:   92  
##  Past Due (1-15 days) :  806   2012-10-30 00:00:00:   81  
##  Past Due (31-60 days):  363   2013-02-26 00:00:00:   78  
##  (Other)              : 1108   (Other)            :54633  
##   BorrowerAPR       BorrowerRate     LenderYield     
##  Min.   :0.00653   Min.   :0.0000   Min.   :-0.0100  
##  1st Qu.:0.15629   1st Qu.:0.1340   1st Qu.: 0.1242  
##  Median :0.20976   Median :0.1840   Median : 0.1730  
##  Mean   :0.21883   Mean   :0.1928   Mean   : 0.1827  
##  3rd Qu.:0.28381   3rd Qu.:0.2500   3rd Qu.: 0.2400  
##  Max.   :0.51229   Max.   :0.4975   Max.   : 0.4925  
##  NA's   :25                                          
##  EstimatedEffectiveYield EstimatedLoss   EstimatedReturn 
##  Min.   :-0.183          Min.   :0.005   Min.   :-0.183  
##  1st Qu.: 0.116          1st Qu.:0.042   1st Qu.: 0.074  
##  Median : 0.162          Median :0.072   Median : 0.092  
##  Mean   : 0.169          Mean   :0.080   Mean   : 0.096  
##  3rd Qu.: 0.224          3rd Qu.:0.112   3rd Qu.: 0.117  
##  Max.   : 0.320          Max.   :0.366   Max.   : 0.284  
##  NA's   :29084           NA's   :29084   NA's   :29084   
##  ProsperRating..numeric. ProsperRating..Alpha.  ProsperScore  
##  Min.   :1.000                  :29084         Min.   : 1.00  
##  1st Qu.:3.000           C      :18345         1st Qu.: 4.00  
##  Median :4.000           B      :15581         Median : 6.00  
##  Mean   :4.072           A      :14551         Mean   : 5.95  
##  3rd Qu.:5.000           D      :14274         3rd Qu.: 8.00  
##  Max.   :7.000           E      : 9795         Max.   :11.00  
##  NA's   :29084           (Other):12307         NA's   :29084  
##  ListingCategory..numeric. BorrowerState  
##  Min.   : 0.000            CA     :14717  
##  1st Qu.: 1.000            TX     : 6842  
##  Median : 1.000            NY     : 6729  
##  Mean   : 2.774            FL     : 6720  
##  3rd Qu.: 3.000            IL     : 5921  
##  Max.   :20.000                   : 5515  
##                            (Other):67493  
##                     Occupation         EmploymentStatus
##  Other                   :28617   Employed     :67322  
##  Professional            :13628   Full-time    :26355  
##  Computer Programmer     : 4478   Self-employed: 6134  
##  Executive               : 4311   Not available: 5347  
##  Teacher                 : 3759   Other        : 3806  
##  Administrative Assistant: 3688                : 2255  
##  (Other)                 :55456   (Other)      : 2718  
##  EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
##  Min.   :  0.00           False:56459         False:101218    
##  1st Qu.: 26.00           True :57478         True : 12719    
##  Median : 67.00                                               
##  Mean   : 96.07                                               
##  3rd Qu.:137.00                                               
##  Max.   :755.00                                               
##  NA's   :7625                                                 
##                     GroupKey                 DateCreditPulled 
##                         :100596   2013-12-23 09:38:12:     6  
##  783C3371218786870A73D20:  1140   2013-11-21 09:09:41:     4  
##  3D4D3366260257624AB272D:   916   2013-12-06 05:43:16:     4  
##  6A3B336601725506917317E:   698   2014-01-14 20:17:49:     4  
##  FEF83377364176536637E50:   611   2014-02-09 12:14:41:     4  
##  C9643379247860156A00EC0:   342   2013-09-27 22:04:54:     3  
##  (Other)                :  9634   (Other)            :113912  
##  CreditScoreRangeLower CreditScoreRangeUpper
##  Min.   :  0.0         Min.   : 19.0        
##  1st Qu.:660.0         1st Qu.:679.0        
##  Median :680.0         Median :699.0        
##  Mean   :685.6         Mean   :704.6        
##  3rd Qu.:720.0         3rd Qu.:739.0        
##  Max.   :880.0         Max.   :899.0        
##  NA's   :591           NA's   :591          
##         FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
##                     :   697     Min.   : 0.00      Min.   : 0.00  
##  1993-12-01 00:00:00:   185     1st Qu.: 7.00      1st Qu.: 6.00  
##  1994-11-01 00:00:00:   178     Median :10.00      Median : 9.00  
##  1995-11-01 00:00:00:   168     Mean   :10.32      Mean   : 9.26  
##  1990-04-01 00:00:00:   161     3rd Qu.:13.00      3rd Qu.:12.00  
##  1995-03-01 00:00:00:   159     Max.   :59.00      Max.   :54.00  
##  (Other)            :112389     NA's   :7604       NA's   :7604   
##  TotalCreditLinespast7years OpenRevolvingAccounts
##  Min.   :  2.00             Min.   : 0.00        
##  1st Qu.: 17.00             1st Qu.: 4.00        
##  Median : 25.00             Median : 6.00        
##  Mean   : 26.75             Mean   : 6.97        
##  3rd Qu.: 35.00             3rd Qu.: 9.00        
##  Max.   :136.00             Max.   :51.00        
##  NA's   :697                                     
##  OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries   
##  Min.   :    0.0             Min.   :  0.000      Min.   :  0.000  
##  1st Qu.:  114.0             1st Qu.:  0.000      1st Qu.:  2.000  
##  Median :  271.0             Median :  1.000      Median :  4.000  
##  Mean   :  398.3             Mean   :  1.435      Mean   :  5.584  
##  3rd Qu.:  525.0             3rd Qu.:  2.000      3rd Qu.:  7.000  
##  Max.   :14985.0             Max.   :105.000      Max.   :379.000  
##                              NA's   :697          NA's   :1159     
##  CurrentDelinquencies AmountDelinquent   DelinquenciesLast7Years
##  Min.   : 0.0000      Min.   :     0.0   Min.   : 0.000         
##  1st Qu.: 0.0000      1st Qu.:     0.0   1st Qu.: 0.000         
##  Median : 0.0000      Median :     0.0   Median : 0.000         
##  Mean   : 0.5921      Mean   :   984.5   Mean   : 4.155         
##  3rd Qu.: 0.0000      3rd Qu.:     0.0   3rd Qu.: 3.000         
##  Max.   :83.0000      Max.   :463881.0   Max.   :99.000         
##  NA's   :697          NA's   :7622       NA's   :990            
##  PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
##  Min.   : 0.0000          Min.   : 0.000            Min.   :      0       
##  1st Qu.: 0.0000          1st Qu.: 0.000            1st Qu.:   3121       
##  Median : 0.0000          Median : 0.000            Median :   8549       
##  Mean   : 0.3126          Mean   : 0.015            Mean   :  17599       
##  3rd Qu.: 0.0000          3rd Qu.: 0.000            3rd Qu.:  19521       
##  Max.   :38.0000          Max.   :20.000            Max.   :1435667       
##  NA's   :697              NA's   :7604              NA's   :7604          
##  BankcardUtilization AvailableBankcardCredit  TotalTrades    
##  Min.   :0.000       Min.   :     0          Min.   :  0.00  
##  1st Qu.:0.310       1st Qu.:   880          1st Qu.: 15.00  
##  Median :0.600       Median :  4100          Median : 22.00  
##  Mean   :0.561       Mean   : 11210          Mean   : 23.23  
##  3rd Qu.:0.840       3rd Qu.: 13180          3rd Qu.: 30.00  
##  Max.   :5.950       Max.   :646285          Max.   :126.00  
##  NA's   :7604        NA's   :7544            NA's   :7544    
##  TradesNeverDelinquent..percentage. TradesOpenedLast6Months
##  Min.   :0.000                      Min.   : 0.000         
##  1st Qu.:0.820                      1st Qu.: 0.000         
##  Median :0.940                      Median : 0.000         
##  Mean   :0.886                      Mean   : 0.802         
##  3rd Qu.:1.000                      3rd Qu.: 1.000         
##  Max.   :1.000                      Max.   :20.000         
##  NA's   :7544                       NA's   :7544           
##  DebtToIncomeRatio         IncomeRange    IncomeVerifiable
##  Min.   : 0.000    $25,000-49,999:32192   False:  8669    
##  1st Qu.: 0.140    $50,000-74,999:31050   True :105268    
##  Median : 0.220    $100,000+     :17337                   
##  Mean   : 0.276    $75,000-99,999:16916                   
##  3rd Qu.: 0.320    Not displayed : 7741                   
##  Max.   :10.010    $1-24,999     : 7274                   
##  NA's   :8554      (Other)       : 1427                   
##  StatedMonthlyIncome                    LoanKey       TotalProsperLoans
##  Min.   :      0     CB1B37030986463208432A1:     6   Min.   :0.00     
##  1st Qu.:   3200     2DEE3698211017519D7333F:     4   1st Qu.:1.00     
##  Median :   4667     9F4B37043517554537C364C:     4   Median :1.00     
##  Mean   :   5608     D895370150591392337ED6D:     4   Mean   :1.42     
##  3rd Qu.:   6825     E6FB37073953690388BC56D:     4   3rd Qu.:2.00     
##  Max.   :1750003     0D8F37036734373301ED419:     3   Max.   :8.00     
##                      (Other)                :113912   NA's   :91852    
##  TotalProsperPaymentsBilled OnTimeProsperPayments
##  Min.   :  0.00             Min.   :  0.00       
##  1st Qu.:  9.00             1st Qu.:  9.00       
##  Median : 16.00             Median : 15.00       
##  Mean   : 22.93             Mean   : 22.27       
##  3rd Qu.: 33.00             3rd Qu.: 32.00       
##  Max.   :141.00             Max.   :141.00       
##  NA's   :91852              NA's   :91852        
##  ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
##  Min.   : 0.00                       Min.   : 0.00                  
##  1st Qu.: 0.00                       1st Qu.: 0.00                  
##  Median : 0.00                       Median : 0.00                  
##  Mean   : 0.61                       Mean   : 0.05                  
##  3rd Qu.: 0.00                       3rd Qu.: 0.00                  
##  Max.   :42.00                       Max.   :21.00                  
##  NA's   :91852                       NA's   :91852                  
##  ProsperPrincipalBorrowed ProsperPrincipalOutstanding
##  Min.   :    0            Min.   :    0              
##  1st Qu.: 3500            1st Qu.:    0              
##  Median : 6000            Median : 1627              
##  Mean   : 8472            Mean   : 2930              
##  3rd Qu.:11000            3rd Qu.: 4127              
##  Max.   :72499            Max.   :23451              
##  NA's   :91852            NA's   :91852              
##  ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
##  Min.   :-209.00             Min.   :   0.0           
##  1st Qu.: -35.00             1st Qu.:   0.0           
##  Median :  -3.00             Median :   0.0           
##  Mean   :  -3.22             Mean   : 152.8           
##  3rd Qu.:  25.00             3rd Qu.:   0.0           
##  Max.   : 286.00             Max.   :2704.0           
##  NA's   :95009                                        
##  LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination   LoanNumber    
##  Min.   : 0.00                 Min.   :  0.0              Min.   :     1  
##  1st Qu.: 9.00                 1st Qu.:  6.0              1st Qu.: 37332  
##  Median :14.00                 Median : 21.0              Median : 68599  
##  Mean   :16.27                 Mean   : 31.9              Mean   : 69444  
##  3rd Qu.:22.00                 3rd Qu.: 65.0              3rd Qu.:101901  
##  Max.   :44.00                 Max.   :100.0              Max.   :136486  
##  NA's   :96985                                                            
##  LoanOriginalAmount          LoanOriginationDate LoanOriginationQuarter
##  Min.   : 1000      2014-01-22 00:00:00:   491   Q4 2013:14450         
##  1st Qu.: 4000      2013-11-13 00:00:00:   490   Q1 2014:12172         
##  Median : 6500      2014-02-19 00:00:00:   439   Q3 2013: 9180         
##  Mean   : 8337      2013-10-16 00:00:00:   434   Q2 2013: 7099         
##  3rd Qu.:12000      2014-01-28 00:00:00:   339   Q3 2012: 5632         
##  Max.   :35000      2013-09-24 00:00:00:   316   Q2 2012: 5061         
##                     (Other)            :111428   (Other):60343         
##                    MemberKey      MonthlyLoanPayment LP_CustomerPayments
##  63CA34120866140639431C9:     9   Min.   :   0.0     Min.   :   -2.35   
##  16083364744933457E57FB9:     8   1st Qu.: 131.6     1st Qu.: 1005.76   
##  3A2F3380477699707C81385:     8   Median : 217.7     Median : 2583.83   
##  4D9C3403302047712AD0CDD:     8   Mean   : 272.5     Mean   : 4183.08   
##  739C338135235294782AE75:     8   3rd Qu.: 371.6     3rd Qu.: 5548.40   
##  7E1733653050264822FAA3D:     8   Max.   :2251.5     Max.   :40702.39   
##  (Other)                :113888                                         
##  LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees   
##  Min.   :    0.0              Min.   :   -2.35   Min.   :-664.87  
##  1st Qu.:  500.9              1st Qu.:  274.87   1st Qu.: -73.18  
##  Median : 1587.5              Median :  700.84   Median : -34.44  
##  Mean   : 3105.5              Mean   : 1077.54   Mean   : -54.73  
##  3rd Qu.: 4000.0              3rd Qu.: 1458.54   3rd Qu.: -13.92  
##  Max.   :35000.0              Max.   :15617.03   Max.   :  32.06  
##                                                                   
##  LP_CollectionFees  LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-9274.75   Min.   :  -94.2       Min.   : -954.5    
##  1st Qu.:    0.00   1st Qu.:    0.0       1st Qu.:    0.0    
##  Median :    0.00   Median :    0.0       Median :    0.0    
##  Mean   :  -14.24   Mean   :  700.4       Mean   :  681.4    
##  3rd Qu.:    0.00   3rd Qu.:    0.0       3rd Qu.:    0.0    
##  Max.   :    0.00   Max.   :25000.0       Max.   :25000.0    
##                                                              
##  LP_NonPrincipalRecoverypayments PercentFunded    Recommendations   
##  Min.   :    0.00                Min.   :0.7000   Min.   : 0.00000  
##  1st Qu.:    0.00                1st Qu.:1.0000   1st Qu.: 0.00000  
##  Median :    0.00                Median :1.0000   Median : 0.00000  
##  Mean   :   25.14                Mean   :0.9986   Mean   : 0.04803  
##  3rd Qu.:    0.00                3rd Qu.:1.0000   3rd Qu.: 0.00000  
##  Max.   :21117.90                Max.   :1.0125   Max.   :39.00000  
##                                                                     
##  InvestmentFromFriendsCount InvestmentFromFriendsAmount   Investors      
##  Min.   : 0.00000           Min.   :    0.00            Min.   :   1.00  
##  1st Qu.: 0.00000           1st Qu.:    0.00            1st Qu.:   2.00  
##  Median : 0.00000           Median :    0.00            Median :  44.00  
##  Mean   : 0.02346           Mean   :   16.55            Mean   :  80.48  
##  3rd Qu.: 0.00000           3rd Qu.:    0.00            3rd Qu.: 115.00  
##  Max.   :33.00000           Max.   :25000.00            Max.   :1189.00  
## 
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Introduction

This data exploration is on a large dataset consisted of 113,937 loans from Prosper Marketplace. The major goal of this exercise is to analyze what factors and how they affect the Prosper Score. Prosper Score ranges from 1 to 11 and is an indicator of the risk of Prosper borrower listings (the highest risk or worst is 1). I will start by looking at the distribution of each variable that I am interested in. I will further explore the relationships between different variables and how they correlate with Prosper Score. Finally a model based on the findings will be created in order to predict the Prosper Score.

Univariate Plots Section

Prosper Score

We start by looking at Prosper Score, which is the target variable we are mostly interested in. I'm going to see how its distribution is like.

plot of chunk ProsperScore_hist

It appears most of the loans are missing Prosper Score values. Let's remove them and redraw the histogram.

plot of chunk ProsperScore_hist2

The Prosper Score seems to be mostly from 4 to 8. The score with the largest population is 4.

plot of chunk CreditGrade_hist

There are quite a lot of missing values (empty values) in the credit grade. Again removing them and redraw the plot.

plot of chunk CreditGrade_hist2

After removing the missing category, it is shown that the number of loans increases with decrease of CreditGrade until “C”. Then it drops with decrease of CreditGrade. There are very few loans with CreditGrade of “NC”. Why don't people tend to borrow loans with the highest CreditGrade “A”, but rather a medium level “C”? Does this has anything to do with the interest, yield or return on investment of loans? I wonder how the loan interest rates and return distribution is like.

Interest Rates, Loss and Return

Borrower APR

Let's start by looking at Borrower APR and plot its histogram.

plot of chunk BorrowerAPR_hist

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230      25

Borrower APR presents a distribution similar to a bell shape, centered at 0.18. But it appears to be consisted of other distribution as there seems to be 3 other spikes at 0.09, 0.29, 0.37. Let's take a closer look by narrowing the bin width.

plot of chunk BorrowerAPR_hist2

A finer tuned histogram shows a similar trend as above but with a prominent peak at 0.37. I wonder why there are so many loans with Borrower APR at 37%? Also, the distribution seems to be multi-modal, particularly peaked at the low-level 0.09.

Lender Yield

Lender Yield should also be similar to Borrower APR because Lender Yield is defined as Borrower APR minus services fee (relatively small). Let's draw a histogram of it.

plot of chunk LenderYield_hist2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925

It confirms that the distributions of both Lender Yield and Borrower APR are very similar except the median value (0.173) shifts to the left.

Estimated Effective Yield

Let's also plot Estimated Effective Yield as I think it is an important factor that calculates return data.

plot of chunk estimated_effective_yield_hist

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.116   0.162   0.169   0.224   0.320   29084

The shape is again quite similar to APR but median value is 0.162.

Estimated Loss

How about the loss of the loans? I think it is importantant as it may affect the final return of loans.

plot of chunk estimated_loss_hist

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.005   0.042   0.072   0.080   0.112   0.366   29084

The Estimated Loss distribution seems to be multimodal. Let's take a closer look.

plot of chunk estimated_loss_hist2

The distribution appears quite discrete, with prominent spikes are 0.02, 0.04, 0.08, 0.09, 0.15 and 0.17. Does this has something to do with Prosper Score, Credit Grade or other risk indicators that are tier/level based?

Estimated Return

After plotting the APR, loss, we should examine the return of the loans, as this is related to the actual profitability of a loan.

plot of chunk estimated_return_hist

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.074   0.092   0.096   0.117   0.284   29084

It appears that the Estimated Return poses a skewed normal distribution, peaking at 0.1 and with a couple of outliers below 0.

plot of chunk estimated_return_hist2

The transformed EstimatedReturn seems to be multimodal peaking at 0.08, 0.11, 0.12, 0.14.

Original Loan Amount

After examining the Borrower ARP, Lender Yield, Estimated Effective Yield/Loss/Return, I am going to look at the loan amount, i.e. how much a borrower can get from a loan.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The median value for the original loan amount is 6500, mean value is 8337. The minimum value is $1000, and the maximum is $35,000.

plot of chunk LoanOriginalAmount_hist

It is noticed that the majority of the loans are below 10000. And it seems that the distribution is multi-modal, with multiple peaks at 4000, 10000, 15000, 20000, 25000. For the small amount of loans (< $4500), the number of loans increases with increase of loan amount. However, after around $4500, the number of loans decreases exponentially (except for a few obvious spikes). Let's finer tune the x axis.

plot of chunk LoanOriginalAmount_hist2

There seems to be quite a few spikes above the trend line which indicates the distribution is quite discrete, with most of loan amounts at times of 5000. Is this a matter of convenience or some sort of convention?

Also, the most popular small loan amount is $4000 and not expected $5000.

There are very few loans with original amount over 25000. This is understandable as larger loan amount indicates higher risk if defaults really happen so fewer lenders are willing to provide large loans.

The distribution seems to be consisted of multiple negative exponential relationships between the number of loans and the original loan amount. Again, let's transform the plot a little bit in order to better understand the pattern.

plot of chunk LoanOriginalAmount_hist3

After adjusting the X axis to the log scale, it shows that with increase of LoanOriginalAmount, number of loans decreases linearly, which indicates negative exponential relationship.

Income

It seems intuitive to me that monthly income plays an important role in determining the risk of a borrower. The higher the income, the less likely the borrower is going to default. Let's start by plotting the histogram of income range.

plot of chunk income_range_hist

## Error in eval(expr, envir, enclos): could not find function "summarize_cat_var"

Most of the borrowers (55%) are within annual income range of $25K-$75K. There are 30% of borrowers over $75K.

To further investigate income, I am going to look at StatedMonthlyIncome. Let's start by creating a summary.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

The StatedMonthlyIncome's median is 4667 and mean value is 5608. But the max value is 1750000, which is obviously an extreme outlier.

I am going to remove the top 1% of monthly income in order to get a better shaped distribution. Also, let's try adjusting the x axis to the log scale to get a less-skewed normal distribution.

plot of chunk StatedMonthlyIncome_hist

Stated Monthly Income presents a normal distribution on the log scale.

Let's see the summary of the data with the log10 operator. We removed top 1% and zero values in order to get fair comparison.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.079   3.512   3.669   3.650   3.827   4.312

The median 3.669 ($4666) and the mean 3.650 ($4467) are very close, and the differences between median with 1st and 3rd quantiles are similar, indicating a better normal distribution.

Debt-to-Income Ratio

The risk of loans also depends on the debt-pay ability of the borrower, i.e. ability to repay their debt. So let's look at Debt to Income Ratio.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

80% of the borrowers' Debt-to-Income Ratio is no more than 0.32.

The median 0.220 is lower than the mean value 0.276. The maximum value is 10.010, which implies the distribution could be skewed by the upper outliers. Let's limit the x axis range to 0 to 1.

plot of chunk DebtToIncomeRatio_hist

The distribution seems to be a positively-skewed normal distribution, peaking around 0.18.

Employment and Occupation

Employment Status

As employment plays an important role in determining the salary level, I'm going to find out how its distribution looks like.

plot of chunk employment_status_hist

## Source: local data frame [9 x 3]
## 
##   EmploymentStatus NumOfLoans NumOfLoans.percent
## 1         Employed      67322        0.590870393
## 2        Full-time      26355        0.231312041
## 3    Self-employed       6134        0.053836769
## 4    Not available       5347        0.046929443
## 5            Other       3806        0.033404425
## 6               NA       2255        0.019791639
## 7        Part-time       1088        0.009549137
## 8     Not employed        835        0.007328611
## 9          Retired        795        0.006977540

The majority (87%) of the loans are from employed borrowers (“Employed”, “Full-Time”, “Part-Time”, “Self-Employed”).

But why is the ambiguous word “Employed” here and what does it mean when compared to Full-Time / Part-Time / Self-Employed? It appears to me this variable needs to be further translated, i.e. we assume the Employed is consisted of the same proportion as is for Full-Time, Part-Time and Self-Employed. Let's leave it for now.

Occupaion

##                            Occupation NumOfLoans NumOfLoans.percent
## 1                               Other      28617       0.2511651176
## 2                        Professional      13628       0.1196099599
## 3                 Computer Programmer       4478       0.0393024215
## 4                           Executive       4311       0.0378366992
## 5                             Teacher       3759       0.0329919166
## 6            Administrative Assistant       3688       0.0323687652
## 7                             Analyst       3602       0.0316139621
## 8                                           3588       0.0314910872
## 9                  Sales - Commission       3446       0.0302447844
## 10                     Accountant/CPA       3233       0.0283753302
## 11                           Clerical       3164       0.0277697324
## 12                     Sales - Retail       2797       0.0245486541
## 13                      Skilled Labor       2746       0.0241010383
## 14                  Retail Management       2602       0.0228371820
## 15                         Nurse (RN)       2489       0.0218454058
## 16                       Construction       1790       0.0157104365
## 17                       Truck Driver       1675       0.0147011068
## 18                            Laborer       1595       0.0139989643
## 19  Police Officer/Correction Officer       1578       0.0138497591
## 20                      Civil Service       1457       0.0127877687
## 21              Engineer - Mechanical       1406       0.0123401529
## 22                  Military Enlisted       1272       0.0111640644
## 23            Food Service Management       1239       0.0108744306
## 24              Engineer - Electrical       1125       0.0098738777
## 25                       Food Service       1123       0.0098563241
## 26                 Medical Technician       1117       0.0098036634
## 27                           Attorney       1046       0.0091805120
## 28               Tradesman - Mechanic        951       0.0083467179
## 29                      Social Worker        741       0.0065035941
## 30                     Postal Service        627       0.0055030412
## 31                          Professor        557       0.0048886665
## 32                            Realtor        543       0.0047657916
## 33                             Doctor        494       0.0043357294
## 34                        Nurse (LPN)        492       0.0043181758
## 35                       Nurse's Aide        491       0.0043093991
## 36            Tradesman - Electrician        477       0.0041865241
## 37                    Waiter/Waitress        436       0.0038266761
## 38                            Fireman        422       0.0037038012
## 39                          Scientist        372       0.0032649622
## 40                   Military Officer        346       0.0030367659
## 41                         Bus Driver        316       0.0027734625
## 42                          Principal        312       0.0027383554
## 43                     Teacher's Aide        276       0.0024223913
## 44                         Pharmacist        257       0.0022556325
## 45 Student - College Graduate Student        245       0.0021503111
## 46                        Landscaping        236       0.0020713201
## 47                Engineer - Chemical        225       0.0019747755
## 48                           Investor        214       0.0018782310
## 49                          Architect        213       0.0018694542
## 50         Pilot - Private/Commercial        199       0.0017465792
## 51                             Clergy        196       0.0017202489
## 52           Student - College Senior        188       0.0016500347
## 53                         Car Dealer        180       0.0015798204
## 54                            Chemist        145       0.0012726331
## 55                       Psychologist        145       0.0012726331
## 56                          Biologist        125       0.0010970975
## 57                          Religious        124       0.0010883207
## 58                   Flight Attendant        123       0.0010795440
## 59                          Homemaker        120       0.0010532136
## 60              Tradesman - Carpenter        120       0.0010532136
## 61           Student - College Junior        112       0.0009829994
## 62                Tradesman - Plumber        102       0.0008952316
## 63        Student - College Sophomore         69       0.0006055978
## 64                            Dentist         68       0.0005968211
## 65         Student - College Freshman         41       0.0003598480
## 66        Student - Community College         28       0.0002457498
## 67                              Judge         22       0.0001930892
## 68         Student - Technical School         16       0.0001404285

The 25% of loans are from “Other” which indicates the distribution of the occupation is long-tail. Ambiguous occupations “Professional” has 12% of loans and ranks second. There is also 3% of missing values. These values are not quite meaningful.

Credit History

Borrowers may already have existing loans on Prosper. It is possible to evaluate the risk by looking at the historical data.

Total Prosper Loans

Let's start by looking at Total Prosper Loans.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.00    1.00    1.42    2.00    8.00   91852

The summary shows there are 91852 NA values, about 80% of total observations. This means the majority of the borrowers are first-time borrowers on Prosper. There is also 0 total prosper loans equal 0, which should be equavalent to NA. Let's see its distribution.

## Source: local data frame [10 x 3]
## 
##    TotalProsperLoans NumOfLoans NumOfLoans.percent
## 1                 NA      91852       8.061648e-01
## 2                  1      15538       1.363736e-01
## 3                  2       4540       3.984658e-02
## 4                  3       1447       1.270000e-02
## 5                  4        417       3.659917e-03
## 6                  5        104       9.127851e-04
## 7                  6         29       2.545266e-04
## 8                  7          8       7.021424e-05
## 9                  0          1       8.776780e-06
## 10                 8          1       8.776780e-06

The 0 value only has 1 loan, which is probably error input. The TotalProsperLoans of 8 has 1 loan which seems to be an outlier. Let's remove it and plot the histogram.

plot of chunk total_prosper_loans_hist

The number of loans appears to be exponentially decreasing as TotalProsperLoans increases. I am going to adjust the y axis and see the relationship.

plot of chunk total_prosper_loans_hist2

## 
##  Pearson's product-moment correlation
## 
## data:  df.numByTotalProsperLoans.subset$TotalProsperLoans and log10(df.numByTotalProsperLoans.subset$NumOfLoans)
## t = -88.357, df = 5, p-value = 3.52e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9999549 -0.9977300
## sample estimates:
##        cor 
## -0.9996799

Above plot and correlation test (corr = -0.9996) confirmed the negative exponential relationship between number of loans and total prosper loans.

Total Prosper Payments Billed

Now take a look at the Total Prosper Payments Billed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.00   16.00   22.93   33.00  141.00   91852

The median value is 16 and mean value is 22.93. The maximum is 141. How does its distribution look like?

plot of chunk TotalProsperPaymentsBilled_hist

The number of loans seems to peak around 16 and with a prominent spike around 35. I'd like to adjust the binwidth to 1 to see the finer picture.

plot of chunk TotalProsperPaymentsBilled_hist2

The above plot shows that the payment bills seem to be quite strange. It seems to be consisted of a couple of exponential-like decreasing distributions. I'm going to find out the cut points of these distributions. Let's preview first 50 summarized data.

##    TotalProsperPaymentsBilled NumOfLoans NumOfLoans.percent
## 1                           0         65       0.0005704907
## 2                           1        422       0.0037038012
## 3                           2        308       0.0027032483
## 4                           3        281       0.0024662752
## 5                           4        257       0.0022556325
## 6                           5        253       0.0022205254
## 7                           6       1219       0.0106988950
## 8                           7        780       0.0068458885
## 9                           8        643       0.0056434696
## 10                          9       1633       0.0143324820
## 11                         10       1144       0.0100406365
## 12                         11       1041       0.0091366281
## 13                         12        822       0.0072145133
## 14                         13        739       0.0064860405
## 15                         14        664       0.0058277820
## 16                         15        558       0.0048974433
## 17                         16        532       0.0046692470
## 18                         17        486       0.0042655152
## 19                         18        449       0.0039407743
## 20                         19        418       0.0036686941
## 21                         20        422       0.0037038012
## 22                         21        382       0.0033527300
## 23                         22        323       0.0028349000
## 24                         23        326       0.0028612303
## 25                         24        303       0.0026593644
## 26                         25        265       0.0023258467
## 27                         26        233       0.0020449898
## 28                         27        263       0.0023082932
## 29                         28        225       0.0019747755
## 30                         29        226       0.0019835523
## 31                         30        207       0.0018167935
## 32                         31        232       0.0020362130
## 33                         32        249       0.0021854183
## 34                         33        254       0.0022293022
## 35                         34        267       0.0023434003
## 36                         35       1069       0.0093823780
## 37                         36        297       0.0026067037
## 38                         37        112       0.0009829994
## 39                         38        116       0.0010181065
## 40                         39        107       0.0009391155
## 41                         40         98       0.0008601245
## 42                         41        103       0.0009040084
## 43                         42         90       0.0007899102
## 44                         43        110       0.0009654458
## 45                         44        148       0.0012989635
## 46                         45        116       0.0010181065
## 47                         46        118       0.0010356601
## 48                         47        113       0.0009917762
## 49                         48        140       0.0012287492
## 50                         49        125       0.0010970975

The cut points, or “jumping” points of exponential decreasing distributions, are 1, 6, 9, and 35.

I wonder why is it like this and suspect there could be a few categorical variables that control the distribution.

Let's take a look at the loans with TotalProsperPaymentsBilled at 35 specifically.

##         A   AA    B    C    D    E   HR   NC 
## 1069    0    0    0    0    0    0    0    0

It appears that there is no CreditGrade for TotalProsperPaymentsBilled at 35.

Other Credit Historical Variables

Other Credit Historical Variables include: CreditScoreRangeLower, CreditScoreRangeUpper, CurrentCreditLines, OpenCreditLines, TotalCreditLinespast7years, OpenRevolvingAccounts, OpenCreditLines, InquiriesLast6Months, TotalInquiries, CurrentDelinquencies, AmountDelinquent, DelinquenciesLast7Years.

As there are so many of them, I am going to plot these 12 variables on one plot and also the summaries of each of them.

(All sub-plots removed top and bottom 1% of values to get rid of outliers. I also applied log scale on y axis for CurrentDelinquencies, AmountDelinquent, DelinquenciesLast7Years as they appear to be negative exponential distribution)

plot of chunk CreditHistory_hist

## [1] "summary of CreditScoreRangeLower"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591 
## [1] ""
## [1] "summary of CreditScoreRangeUpper"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   699.0   704.6   739.0   899.0     591 
## [1] ""
## [1] "summary of CurrentCreditLines"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   10.00   10.32   13.00   59.00    7604 
## [1] ""
## [1] "summary of OpenCreditLines"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00    9.00    9.26   12.00   54.00    7604 
## [1] ""
## [1] "summary of TotalCreditLinespast7years"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   17.00   25.00   26.75   35.00  136.00     697 
## [1] ""
## [1] "summary of OpenRevolvingAccounts"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00    6.00    6.97    9.00   51.00 
## [1] ""
## [1] "summary of OpenRevolvingMonthlyPayment"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   114.0   271.0   398.3   525.0 14980.0 
## [1] ""
## [1] "summary of InquiriesLast6Months"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   1.435   2.000 105.000     697 
## [1] ""
## [1] "summary of TotalInquiries"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   4.000   5.584   7.000 379.000    1159 
## [1] ""
## [1] "summary of CurrentDelinquencies"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.5921  0.0000 83.0000     697 
## [1] ""
## [1] "summary of AmountDelinquent"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0      0.0      0.0    984.5      0.0 463900.0     7622 
## [1] ""
## [1] "summary of DelinquenciesLast7Years"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.155   3.000  99.000     990 
## [1] ""

The distributions of CreditScoreRangeLower and CreditScoreRangeUpper seem to be exactly the same. They are both a negatively skewed normal distribution with median value at 680.

CurrentCreditLines, OpenCreditLines, TotalCreditLinespast7Years, OpenRevolvingAccounts appear to be normally distributed, with mean values at 10.32, 9.26, 26.75, 398.3, respectively.

OpenRevolvingMonthlyPayment seems to be a half normal distribution with mean value at 0. The median value is 271 for the whole population.

The distribution of InquiriesLast6Month seems to be negative exponential, with median value at 1. TotalInquiries appears to be following long tail distribution.

It is presented that CurrentDelinquencies, AmountDelinquent and DelinquenciesLast7Years look like negative exponential distributions.

Date

Now I am going to take a look at the loan origination date because I am interested in finding out the how the number of loans changes with time.

Let's convert the date variable to date type and aggregate by week (binwidth = 7).

plot of chunk LoanDate_hist

The time series plot shows the number of loans increases since 2006 and stumbled to 0 since end of 2008 (due to the finacial crisis). It recovers to increase since the end of 2009 and had a significant jump at the beginning of 2013.

Loan Status

Finally, let's check the loan status and how the distribution is like.

plot of chunk loan_status_hist

## Source: local data frame [12 x 3]
## 
##                LoanStatus NumOfLoans NumOfLoans.percent
## 1                 Current      56576       0.4965551138
## 2               Completed      38074       0.3341671274
## 3              Chargedoff      11992       0.1052511476
## 4               Defaulted       5018       0.0440418828
## 5    Past Due (1-15 days)        806       0.0070740848
## 6   Past Due (31-60 days)        363       0.0031859712
## 7   Past Due (61-90 days)        313       0.0027471322
## 8  Past Due (91-120 days)        304       0.0026681412
## 9   Past Due (16-30 days)        265       0.0023258467
## 10 FinalPaymentInProgress        205       0.0017992399
## 11   Past Due (>120 days)         16       0.0001404285
## 12              Cancelled          5       0.0000438839

The majority of the loans are Current (50%) and 33% are completed. There are 4.4% of defaulted loans and 10.5% of charged-off loans.

Univariate Analysis

What is the structure of your dataset?

The dataset contains total 113,937 observations with 81 variables.

Interesting categorical variables:

What is/are the main feature(s) of interest in your dataset?

The most interesting feature is ProsperScore, EstimatedReturn, BorrowerAPR, LoanOriginalAmount because I'd like to create a preditive model that predicts the ProsperScore that informs the lender on a loan decision. I believe EstimatedReturn, BorrowerAPR, LoanOriginalAmount can be important indicators of the ProsperScore.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think three categories of variables will help predict the model I would like to create.

Did you create any new variables from existing variables in the dataset?

I created 4 new variables which converts the factor variables relating to date (ListingCreationDate, ClosedDate, DateCreditPulled, FirstRecordedCreditLine) into Date type. I created them due to the need to draw the time series plot of LoanOriginationDate.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

In the historgram of TotalProsperLoans, I adjusted the Y Axis to log scale. This allows the negative exponential relationship to be better presented. StatedMonthlyIncome histogram's X Axis was adjusted to log scale in order to present its normal distribution.

I also adjusted the bin width for a few histograms including LoanOriginalAmount, EstimatedReturn, EstimatedLoss, BorrowerAPR. I did so because default binwidth from qplot is too large so that some of the finer patterns (such as spikes over times of 5000 of LoanOriginalAmount).

Bivariate Plots Section

Correlaion Matrix

To find out the correlation with most important variables of interests, I am going to create a correlation matrix plot first.

## quartz_off_screen 
##                 2

The correlation matrix shows that ProsperScore has strong negative correlation with BorrowerAPR (-0.67), LenderYield (-0.65), EstimatedEffectiveYield (-0.63), EstimatedLoss (-0.67) and some negative correlation with EstimatedReturn (-0.38). BorrowerAPR and LenderYield are correlated with each other (corr.: 0.99). ProsperScore has strong positive correlation with CreditScoreRangeLower (0.37) and CreditScoreRangeUpper (0.37) (these two are correlated with each other with correlation at 1). ProsperScore also has some correlation with DebtToIncomeRatio (-0.13) and StatedMonthlyIncome (0.08).

Prosper Score vs Interested Related Variables

We should closely look at the relationship of ProsperScore and the interest-related variables (Borrower APR, Lender Yield, Estimated Effective Yield, Estimated Loss, Estimated Return).

A quick preview of these correlation is plotted as below.

plot of chunk PS_v_Income1

The boxplots visually present that the median value of ProsperScore decreases with increasing BorrowerAPR, LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn. This makes sense as it is intuitive for lenders to get higher return at higher risk (lower ProsperScore).

Prosper Score vs Borrower APR

Now let's take a closer look at the relationship between Prosper Score and Borrower APR.

plot of chunk PS_v_BorrowerAPR_scatter

The overall BorrowerAPR posed a decreasing trend with increase of ProsperScore. The correlation coefficient is -0.668, indicating a moderate correlation.

In the mean time, it is noticed that there are a few dense area of BorrowerAPR around 0.36 for ProsperScore within 2-5. What are those loans and why they stay close to 0.36? I suspect the removal of those loans will result in stronger correlation.

I'd like to plot the histograms for each ProsperScore to highlight this pattern.

plot of chunk PS_v_BorrowerAPR_hist

The above plots confirms that there are quite many loans (ProsperScore <= 5) at BorrowerAPR around 0.35. What are they? Also, for loans with ProsperScore in 6-7, the BorrowersAPR distribution appears to be bimodal or multimodal peaking at 0.2 and 0.3.

I'm going to investigate a little more into those odd loans. Let's subset the data by limiting BorrowerAPR between 0.35 and 0.37 and ProsperScore between 2 and 5, and summarize it.

## Source: local data frame [69 x 3]
## 
##    BorrowerAPR NumOfLoans NumOfLoans.percent
## 1      0.35797       3410        0.515105740
## 2      0.35643       1531        0.231268882
## 3      0.35356        656        0.099093656
## 4      0.35132        254        0.038368580
## 5      0.35285        214        0.032326284
## 6      0.35838        117        0.017673716
## 7      0.35843        101        0.015256798
## 8      0.35244         69        0.010422961
## 9      0.35858         66        0.009969789
## 10     0.35090         50        0.007552870
## ..         ...        ...                ...

There are 3410 loans with BorrowerAPR at 0.35797 and 1531 loans at 0.35643.

After a quick check through all variables for the odd loans, it is noticed that the variable ProsperRating..Alpha. of the odd loan subset is quite unusual, because it only contains “E” and “HR” rather than all levels (“A”, “AA”, “B”, “C”, “D”, “E”, “HR”). (The code has been commented out as the output is too large, you can comment back in the above R block).

Let's plot a faceted histogram and a colored density plot of BorrowerAPR by ProsperRating..Alpha. using all dataset.

plot of chunk PS_v_BorrowerAPR_oddity4 plot of chunk PS_v_BorrowerAPR_oddity4

According to the plots above, it is clear that Borrower APR is highly dependent on Prosper Rating, as each Prosper Rating corresponds to a certain range of BorrowerAPR (shown below).

ProsperRating..Alpha. ProsperRating..numeric. BorrowerAPR Range
A 7 0.05 - 0.1
AA 6 0.10 - 0.16
B 5 0.16 - 0.21
C 4 0.21 - 0.26
D 3 0.21 - 0.26
E 2 0.26 - 0.31
HR 1 0.31 - 0.36

Fortunately there is a numeric variable ProsperRating..numeric., we can plot a scatter plot of Borrower APR against Prosper Rating's numeric value.

plot of chunk PR_v_BorrowerAPR

The above scatter plot presents a strong negative correlation (-0.962). Actually, this relationship is evident according to Prosper's Explanation on Personal Loan Rates and Fees. This ultimately explains previous oddity why there are prominent spikes with BorrowerAPR around 0.36 (because the majority of them are rated as “HR”, the lowest level of Prosper Rating, the highest risk).

It is interesting to notice that there are quite many loans from HR tier, which leads to another question: is this an intentional profitability motivation of the investors (higher risk = higher return)?

Let's then see when those “HR” loans were originated.

plot of chunk PS_v_BorrowerAPR_oddity2

There are many “HR” loans in the year of 2013. There is also a gap at the beginning of 2012. The spike of “HR” loans during the 4th quarter of 2013 is prominent. I really suspect this is a result of speculation by some high-risk investors.

Prosper Score vs Estimated Return

plot of chunk PS_v_EstimatedReturn_scatter

The relationship between ProsperScore and EstimatedReturn appears to be similar to the one between ProsperScore and BorrowerAPR. But the correlation seems to be moderate, only -0.38.

Prosper Score vs Income Range

I am plotting a boxplot of ProsperScore against IncomeRange in order to see their relationship.

plot of chunk PS_v_IncomeRange

IncomeRange lower than or equal to $50K (including “Not employed”) get median value of ProsperScore 5. $50K-$100K gets 6 and over $100K gets 7.

Prosper Score vs Stated Monthly Income

Let's be more precise about income by using the numeric variable StatedMonthlyIncome.

plot of chunk PS_v_StatedMonthlyIncome_box

The boxplot shows that the median value of StatedMonthlyIncome tends to be higher as Prosper Score increases. Yet there is only one oddity: ProsperScore of 1 has higher income than Score 2 and even 5. This could be due to that some people lied about their income which led to a low credit ratings and thus a low ProsperScore. The median values of StatedMonthlyIncome against ProsperScore appears to be an exponential distribution.

I am going to create a scatter plot in order to evaluate the relationship between StatedMonthlyIncome and ProsperScore.

plot of chunk PS_v_StatedMonthlyIncome_scatter

The scatter plot shows there is a positive correlation between StatedMonthlyIncome and ProsperScore, but the correlation is very weak (0.203).

Prosper Score vs Debt-to-Income Ratio

Let's evalute the relationship between the debt-to-income ratio and Prosper Score.

plot of chunk PS_v_DebtToIncomeRatio_box

Similar to StatedMonthlyIncome, DebtToIncomeRatio's median value presents a decreasing trend against ProsperScore in general. Let's see its actual correlation via a scatter plot.

plot of chunk PS_v_DebtToIncomeRatio_scatter

The DebtToIncomeRatio is negatively correlated with ProsperScore with correlation coefficient at -0.282, quite weak.

Prosper Score vs Credit Score Range Lower/Upper

From previous correlation matrix, another strong variable with strong correlation with ProsperScore is CreditScoreRangeLower/Upper. The correlation between CreditScoreRangeLower and CreditScoreRangeUpper is 1, perfect correlated. Therefore we only need one of them to create the scatter plot. Let's use CreditScoreRangeLower to plot the scatter plot against ProsperScore.

plot of chunk PS_v_CreditScoreRangeLower_scatter

The ProsperScore is positvely correlated with CreditScoreRangeLower, with correlation coefficient at 0.370.

Original Loan Amount

I am going to look at how ProsperScore is correlated with LoanOriginalAmount as I suspect lenders would be more cautious with lending larger amount with larger risks (lower Prosper Score).

plot of chunk PS_v_LoanOriginalAmount_box

In general, the median value of LoanOriginalAmount shows that it increases as ProsperScore increases. The correlation is still quite weak, at 0.267.

Borrower APR vs Original Loan Amount

I wonder if BorrowerAPR, which is highly correlated to Prosper Score and Prosper Rating, has any relationship with LoanOriginalAmount.

Let's draw their scatter plot.

plot of chunk BorrowerAPR_v_LoanOriginalAmount_scatter

There is a negative correlation between BorrowerAPR and LoanOriginalAmount, with correlation coefficient equal to -0.323.

It is also noticed that with the increase of LoanOriginalAmount, the width of BorrowerAPR is narrowing down from 0.05-0.40 for $1K to 0.10-0.20 at $35K.

plot of chunk cut_LoanOriginalAmount_trans

The above faceted histograms confirmed this pattern. The interquartile range narrows down from 0.15 to only 0.03 when the original loan amount increased from the lowest to the highest. This indicates that there are fewer APR options when quoting larger loans.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As described in the Bivariate Plots Section, ProsperScore has strong negative correlation with BorrowerAPR, LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn.

ProsperScore has strong positive correlation with CreditScoreRangeLower and CreditScoreRangeUpper (these two are correlated with each other). It has some correlation with DebtToIncomeRatio (negative) and StatedMonthlyIncome (positive).

Also, as the LoanOriginalAmount increases, the BorrowerAPR is narrowing down, which means the options for larger loans are limited.

By investigating the distributions of BorrowerAPR by ProsperScore, it is found that there are quite a lot of loans with interest of 0.35797 and 0.35643 within ProsperScore range of 2-5. A further investigation shows that these loans are actually loans with Prosper Rating of “HR”. This also led me to notice that BorrowerAPR is strongly correlated with ProsperRating..Alpha. and ProsperRating..numeric., with correlation coefficient at -0.962.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

CreditScoreRangeLower and CreditScoreRangeUpper are correlated with each other (corr. = 1). BorrowerAPR and LenderYield are correlated to each other (corr. = 0.99). BorrowerAPR is also highly correlated with ProsperRating..numeric..

What was the strongest relationship you found?

The ProsperScore is strongly correlated with interest-related variables, particularly BorrowerAPR. It has some correlation with CreditScoreRangeLower and CreditScoreRangeUpper (0.36), but this is lower than the one with BorrowerAPR, LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn.

Multivariate Plots Section

Prosper Score vs Borrower APR by Prosper Rating

As the variable with the strongest correlation with ProsperScore, BorrowerAPR should be further investigated.

As Borrower APR is determined by Prosper Rating, Let's re-plot the ProsperScore vs BorrowerAPR scatter plot and color the points by ProsperRating..Alpha..

plot of chunk PS_v_BorrowerAPR_by_PR_scatter

The above scatter plot presents that the Borrower APR decreases when Prosper Rating increases. As Prosper Score increases, Borrower APR appears to shift from the higher range to the lower range, and more higher Prosper Rating loans emerge. Most “HR” loans seem to be concentrated with Prosper Score from 1 to 5.

It also seems that as Prosper Rating getting higher, the correlation of BorrowerAPR against ProsperScore changed from positive to negative. Let's plot the linear model fitted line and correlation coefficent for each Prosper Rating tier.

plot of chunk PS_v_BorrowerAPR_by_PR_scatter2

## [1] "Corr (AA): -0.369"
## [1] "Corr (A): -0.206"
## [1] "Corr (B): -0.062"
## [1] "Corr (D): 0.346"
## [1] "Corr (E): 0.271"
## [1] "Corr (HR): 0.076"

According to the scatter plot above and summary, it is clear that the the correlation becomes negative when Prosper Rating is higher than or equal to “B”, and their absolute slope value seems to be larger when Prosper Rating increases. The correlation is nearly 0 for “HR” and “B”.

The Prosper Score also has some correlation with Credit Score (0.37). Therefore, let's facet the above scatter plot by CreditScoreRangeLower.bucket (cut CreditScoreRangeLower by 50 from 600 to 800).

plot of chunk PS_v_BorrowerAPR_by_PR_scatter3

The faceted scatter plot shows that when Credit Score increases, loans with higher Prosper Rating are more likely to appear as the center of the area shifts from the top to the bottom. It is also interesting to notice the gap area between BorrowerAPR of 0.3-0.35 for CreditScoreRangeLower over 700.

I am going to create a density plot of BorrowerAPR colored by Prosper Rating and faceted by Prosper Score in order to find out if there are any variability among them.

plot of chunk PS_v_BorrowerAPR_by_PR_hist

The plot above shows how the distribution of BorrowerAPR changes with different Prosper Score. There are hardly any “HR” rating over Prosper Score of 7. The overall distribution of Prosper Rating shift from high to low and Borrower APR shifts from low to high as Prosper Score increases. This explained the negative correlation of BorrowerAPR and ProsperScore stated previously.

Prosper Score vs Estimated Return by Prosper Rating

I am interested to know how Prosper Score and Prosper Rating are affecting the Estimated Return. I am going to re-plot EstimatedReturn vs ProsperScore by coloring ProsperRating..Alpha..

plot of chunk PS_v_EstimatedReturn_by_PR_scatter

The first thing I noticed is that there are quite a few points with negative return. Those points are almost Prosper Rating of “HR”, the lowest rating. The plot seems a little bit overplotted, so let's adjust the alpha and re-plot it, and just in case I will add the linear model fitted line for each Prosper Rating tier.

plot of chunk PS_v_EstimatedReturn_by_PR_scatter2

## [1] "Corr (AA): -0.264"
## [1] "Corr (A): -0.019"
## [1] "Corr (B): 0.08"
## [1] "Corr (D): 0.454"
## [1] "Corr (E): 0.318"
## [1] "Corr (HR): 0.271"

Again, in general, the top Prosper Rating (“AA”) is in the bottom right and the worst rating (“HR”) is in the top left. And the correlation between EstimatedReturn and ProsperScore is decreasing when ProsperRating becomes lower. “AA” rating loans has negative correlation between EstimatedReturn and ProsperScore, meaning it is not always a good choice for investors to invest in “AA” loans with high Prosper Score due to lower returns.

Prosper Score vs Estimated Return by Credit Score Range and Prosper Rating

In order to add another dimension Credit Score Range, I am going to plot the Prosper Score vs Estimated Return colored by Prosper Rating and faceted by Credit Score Range.

plot of chunk PS_v_EstimatedReturn_CSR_PR

The faceted colored scatter plot extends the previous scatter plot further. EstimatedReturn and ProsperScore correlate each other negatively as CreditScoreRangeLower.bucket varies.

Prosper Score vs Estimated Return by Income Range

I also curious to know if income has anything to do with Prosper Score.

plot of chunk PS_v_EstimatedReturn_IncomeRange

## [1] "Corr ($0): -0.072"
## [1] "Corr ($1-24,999): -0.192"
## [1] "Corr ($25,000-49,999): -0.241"
## [1] "Corr ($50,000-74,999): -0.369"
## [1] "Corr ($75,000-99,999): -0.438"
## [1] "Corr ($100,000+): -0.482"

The scatter plot confirms the negative correlation between ProsperScore and BorrowerAPR. This correlation tends to be larger with higher income range.

Borrower APR vs Estimated Effective Yield by Prosper Score

As we know the Estimated Effective Yield is calculated from Borrower APR. I wonder how they varying with Prosper Score.

plot of chunk PS_v_EstimatedEffectiveYield

The distribution for BorrowerAPR and EstimatedEffectiveYield is clearly consisted of multiple straight lines. This could not be explained by Prosper Score. I suspect this is due to the different tiers of rate of servicing fees or charge-off fees, according to the definition of the variables.

Prediction Model of Prosper Score

Now let's create a linear model based on previous findings to predict the ProsperScore.

## 
## Calls:
## m1: lm(formula = ProsperScore ~ BorrowerAPR, data = df)
## m2: lm(formula = ProsperScore ~ BorrowerAPR + ProsperRating..numeric., 
##     data = df)
## m3: lm(formula = ProsperScore ~ BorrowerAPR + ProsperRating..numeric. + 
##     CreditScoreRangeLower, data = df)
## m4: lm(formula = ProsperScore ~ BorrowerAPR + ProsperRating..numeric. + 
##     CreditScoreRangeLower + EstimatedEffectiveYield + EstimatedLoss + 
##     EstimatedReturn, data = df)
## m5: lm(formula = ProsperScore ~ BorrowerAPR + ProsperRating..numeric. + 
##     CreditScoreRangeLower + EstimatedEffectiveYield + EstimatedLoss + 
##     EstimatedReturn + IncomeRange, data = df)
## 
## =====================================================================================================
##                                               m1          m2          m3          m4          m5     
## -----------------------------------------------------------------------------------------------------
## (Intercept)                                10.454***   0.174       0.982***     6.921***    7.928*** 
##                                            (0.018)    (0.111)     (0.144)      (0.155)     (0.170)   
## BorrowerAPR                               -19.873***   4.101***    4.124***   -44.556***  -44.557*** 
##                                            (0.076)    (0.265)     (0.265)      (0.664)     (0.661)   
## ProsperRating..numeric.                                1.190***    1.211***     0.721***    0.704*** 
##                                                       (0.013)     (0.013)      (0.016)     (0.016)   
## CreditScoreRangeLower                                             -0.001***    -0.003***   -0.003*** 
##                                                                   (0.000)      (0.000)     (0.000)   
## EstimatedEffectiveYield                                                       -10.004***  -10.129*** 
##                                                                                (0.195)     (0.194)   
## EstimatedLoss                                                                  49.989***   49.746*** 
##                                                                                (0.745)     (0.743)   
## EstimatedReturn                                                                60.372***   60.534*** 
##                                                                                (0.599)     (0.597)   
## IncomeRange: $0/Not employed                                                               -0.909*** 
##                                                                                            (0.241)   
## IncomeRange: $1-24,999/Not employed                                                        -0.704*** 
##                                                                                            (0.066)   
## IncomeRange: $25,000-49,999/Not employed                                                   -0.886*** 
##                                                                                            (0.063)   
## IncomeRange: $50,000-74,999/Not employed                                                   -0.803*** 
##                                                                                            (0.063)   
## IncomeRange: $75,000-99,999/Not employed                                                   -0.674*** 
##                                                                                            (0.063)   
## IncomeRange: $100,000+/Not employed                                                        -0.519*** 
##                                                                                            (0.063)   
## -----------------------------------------------------------------------------------------------------
## R-squared                                       0.447       0.499       0.499       0.562       0.566
## adj. R-squared                                  0.447       0.499       0.499       0.562       0.566
## sigma                                           1.768       1.683       1.682       1.572       1.566
## F                                           68477.863   42213.383   28192.901   18166.277    9217.843
## p                                               0.000       0.000       0.000       0.000       0.000
## Log-likelihood                            -168748.663 -164550.061 -164511.748 -158798.552 -158444.527
## Deviance                                   265198.520  240210.805  239993.981  209757.651  208014.627
## AIC                                        337503.326  329108.121  329033.495  317613.103  316917.054
## BIC                                        337531.372  329145.516  329080.238  317687.893  317047.935
## N                                           84853       84853       84853       84853       84853    
## =====================================================================================================

The best model is m5, as it has the maximum R-Squared (0.566). It seems the biggest improvement comes from adding variables of EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn.

Let's evaluate its performance by checking the predicted ProsperScore and the actual ProsperScore.

## 
##     0     1 
## 68208 16645

The results of the model seems to be quite bad as it only predicts 17% correctly. I would like to investigate why it's underperforming by plotting the distributions of actual values and prediction.

##    ProsperScore ProsperScore.pred
## 1            NA                NA
## 2             7                 7
## 3            NA                NA
## 4             9                 7
## 5             4                 4
## 6            10                 6
## 7             2                 3
## 8             4                 5
## 9             9                 8
## 10           11                 8
## 11            7                 6
## 12           NA                NA
## 13            4                 6
## 14            8                 7
## 15            8                 7
## 16            5                 2
## 17            4                 4
## 18           NA                NA
## 19            7                 7
## 20            8                 4
## 21            7                 6
## 22           NA                NA
## 23            2                 2
## 24            5                 4
## 25            5                 5
## 26            3                 3
## 27            3                 4
## 28            9                 8
## 29            4                 5
## 30            6                 7
## 31            9                 7
## 32            5                 2
## 33            8                 6
## 34           10                 9
## 35            5                 5
## 36            8                 6
## 37            2                 3
## 38            6                 4
## 39            9                 8
## 40           NA                NA
## 41            4                 6
## 42            8                 7
## 43           NA                NA
## 44            6                 5
## 45            5                 5
## 46            7                 7
## 47           NA                NA
## 48            8                 7
## 49            6                 7
## 50           10                 7

plot of chunk modelling3

It is shown that the predicted values are almost all from 3 to 9, and few at 1-2 or 10-11. It could be that the distribution of ProsperScore is biased as it is mostly within from 3 to 9. Perhaps a linear model is insufficient to explain all the variances of ProsperScore.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I plotted the BorrowerAPR against ProsperScore colored by ProsperRating..Alpha. and noticed there seems to be different linear relationships within each Prosper Rating tier. For high range ratings “AA” to “B”, the correlation is negative, and for lower range ratings “C” to “HR”, the correlation is positive. The BorrowerAPR and ProsperRating..Alpha. are highly correlated.

Also, by looking at the variables of BorrowerAPR, LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn, I found BorrowerAPR and LenderYield are tightly correlated with each other (correlation equal 0.99). It appears that CreditScoreRangeLower and ProsperRating..numeric. are strengthened each other (correlation equal 0.549).

Were there any interesting or surprising interactions between features?

It is interesting to notice that given higher IncomeRange, BorrowerAPR decreases more slowly as ProsperScore increases. When income above or equal $75K, the correlation is near -0.45. It is also surprising because I think BorrowerAPR should not be very sensitive to the borrower's income.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model with the dataset. The main strength of the model is that it highlights the what factors contribute to the ProsperScore. For example, for model m5, for every 0.022 increase of BorrowerAPR, the ProsperScore decreases by about 1 (=0.022 * -44.557). However, the R squared of this model is small (~0.566), which means another half of the variance is unexplained.


Final Plots and Summary

Plot One

plot of chunk Plot_One

Description One

I have chosen this plot because it is reflective to show how the distribution of APR changes at different Prosper Scores and how it is related to Prosper Rating.

This faceted histogram presents a clear pattern of distribution of Borrower APR. When Prosper Score increases from 1 to 11, the overall APR distribution shifts from right to left. The spike of lower range Prosper Score is that loans of Rating “HR” has APR mostly around 36%. Different Prosper Rating level corresponds to different APR range. The higher Prosper Rating, the lower APR range. Actually, according to Prosper, APR is affected by Prosper Rating, which is an indicator of expected loss rates, whose base rate is partly determined by in-house developed Prosper Score.

Plot Two

plot of chunk Plot_Two

Description Two

I have chosen this boxplot as it clearly demonstrates a positive correlation between monthly income and Prosper Score (though the correlation is weak, only around 0.084). The plot shows that the monthly income is generally higher given a higher Prosper Score. The only exception is the prosper score at 1, when the monthly income is slightly higher than score of 2-5, which is possibly because of income overstatement or non-verifiable issues.

Plot Three

plot of chunk Plot_Three

Description Three

I chose this time series plot because it tells a story about the Great Resessions of 2008-2012.

The number of loans originated increases since 2006 and stumbled to 0 since the end of 2008, as a result of the 2008 Global Finacial Crisis. Since the year end of 2009, the loans started to recover and climbed to 100 per day during the 3rd quarter in 2012 until dropping to 50 due to the 2012 Financial Crisis. Since 2013, loans started to increase and had a significant jump from 50 to 4 times more over the next 12 months.


Reflection

The biggest challenge for this project is the selection of variables. There are many variables (81) in this dataset and it is quite hard to decide which variables to begin with. Fortunately, following the example project, I was able to create a correlation matrix which eased the variable selection process.

I soon realised the ProsperScore is highly correlated with interest related variables such as BorrowerAPR and LenderYield. Before performing any analyses, I did the data profiling including looking at the types and structure of data columns. After performing the univariate exploration, I was able to get a general idea of the distribution of each interesting variable and discovered that there are so many loans with BorrowAPR at around 0.36. I later found out those loans came from ProsperRating.Alpha. of “HR”, which led me to explore this variable further. I was also able to tell the stories from the financial crisis impact from the time series plot from LoanOriginationDate.

In the Bivariate Plots Section, I confirmed the negative correlation between BorrowerAPR and ProsperScore by looking at the jittered scatterplot and boxplot.

I still struggled to understand how the ProsperScore can be predicted from properly selected variables. Further effort could probably be focused on analysing the variable association with the ProsperScore quantatively in order to improve the prediction model.